Data Science for Crime Scientists
| account | tweet_freq |
|---|---|
| bot | 33.7960 |
| bot | 34.5973 |
| bot | 34.2955 |
| bot | 32.2615 |
| real | 19.1904 |
| real | 14.4237 |
| real | 37.6143 |
| real | 29.0793 |
if A then Boutcome and features areLearning from data to make predictions about the future.
caret in practicemy_first_model = glm(account ~ .
, data = data2
, family = 'binomial'
)We have trained a model.
= you have taught an algorithm to learn to predict real vs bot accounts based on followers and tweet frequency
data2$model_predictions = predict(my_first_model, data2, type = 'response')
data2$model_1 = ifelse(data2$model_predictions >= .5, 'real', 'bot')| bot | real | |
|---|---|---|
| bot | 90 | 10 |
| real | 14 | 86 |
Think about what we did
How to solve this?
caretlibrary(caret)caret: data partitioningset.seed(1)
in_training = createDataPartition(y = data3$account
, p = .8
, list = FALSE
)training_data = data3[ in_training,]
test_data = data3[-in_training,]my_second_model = train(account ~ .
, data = training_data
, method = "svmLinear"
)model_predictions = predict(my_second_model, test_data)| bot | real | |
|---|---|---|
| bot | 19 | 1 |
| real | 4 | 16 |
Can we do some kind of safeguarding in the training data?
K-fold cross-validation
carettraining_controls = trainControl(method="cv"
, number = 4
, classProbs = TRUE
)
my_third_model = train(account ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)my_third_model## Support Vector Machines with Linear Kernel
##
## 160 samples
## 2 predictor
## 2 classes: 'bot', 'real'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 120, 120, 120, 120
## Resampling results:
##
## Accuracy Kappa
## 0.85 0.7
##
## Tuning parameter 'C' was held constant at a value of 1
model_predictions = predict(my_third_model, test_data)| bot | real | |
|---|---|---|
| bot | 19 | 1 |
| real | 4 | 16 |
Naive Bayes
\(P(A|B) = \frac{P(B|A)*P(A)}{P(B)}\)
In ML we are interested in:
\(P(outcome|data) = \frac{P(data|outcome)*P(outcome)}{P(data)}\)
Our scenario: text classification (normal vs spam email)
| id | outcome | “cash” | “rich” | “gmail” | “happy” |
|---|---|---|---|---|---|
| 1 | normal | 1 | 0 | 1 | 1 |
| 2 | normal | 0 | 0 | 1 | 0 |
| 3 | spam | 1 | 0 | 0 | 0 |
| 4 | spam | 1 | 1 | 0 | 0 |
\(P(spam|text) = \frac{P(text|spam)*P(spam)}{P(text)}\)
What we know:
\(P(text|spam) = P(word_1|spam) * P(word_2|spam) ... P(word_n|spam)\)
| id | outcome | “gmail” |
|---|---|---|
| 1 | normal | 1 |
| 2 | normal | 1 |
| 3 | spam | 0 |
| 4 | spam | 0 |
\(P('gmail'|spam) = \frac{0}{2} = 0.00\)
Results in zero for the whole of:
\(P(text|spam) = P(word_1|spam) * P(word_2|spam) ... P(word_n|spam)\)
\(P('gmail'|spam) = \frac{'gmail'_{spam} + \alpha}{'gmail'_{total} + \alpha*k}\)
For: \(\alpha = 1\) and \(k = 2\) (= no. of outcomes):
\(P('gmail'|spam) = \frac{1}{4} = 0.25\)
For any new text, we can use \(P(spam|text)\) to estimate whether it is spam or not.
Source: chrisalbon.com
svm_model = train(account ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)
nb_model = train(account ~ .
, data = training_data
, trControl = training_controls
, method = "nb"
)| bot | real | |
|---|---|---|
| bot | 96 | 104 |
| real | 102 | 98 |
| bot | real | |
|---|---|---|
| bot | 124 | 76 |
| real | 140 | 60 |
| Bot | Real | |
|---|---|---|
| Bot | True positives | False negatives |
| Real | False positives | True negatives |
\(acc=\frac{(TP+TN)}{N}\)
\(acc_{svm}=\frac{(111+86)}{400} = 0.49\)
\(acc_{nb}=\frac{(124+60)}{400} = 0.47\)
Any problems with that?
| Bot | Real | |
|---|---|---|
| Bot | 100 | 100 |
| Real | 5 | 195 |
| Bot | Real | |
|---|---|---|
| Bot | 150 | 50 |
| Real | 55 | 145 |
Needed: more nuanced metrics
## prediction
## reality Bot Real Sum
## Bot 100 100 200
## Real 5 195 200
## Sum 105 295 400
## prediction
## reality Bot Real Sum
## Bot 150 50 200
## Real 55 145 200
## Sum 205 195 400
i.e. –> how often the prediction is correct when prediction class X
Note: we have two classes, so we get two precision values
Formally:
## prediction
## reality Bot Real Sum
## Bot 100 100 200
## Real 5 195 200
## Sum 105 295 400
| Model 1 | Model 2 | |
|---|---|---|
| \(acc\) | 0.74 | 0.74 |
| \(Pr_{bot}\) | 0.95 | 0.73 |
| \(Pr_{real}\) | 0.64 | 0.74 |
i.e. –> how many of class X is detected
Note: we have two classes, so we get two recall values
Also called sensitivity and specificity!
Formally:
## prediction
## reality Bot Real Sum
## Bot 100 100 200
## Real 5 195 200
## Sum 105 295 400
| Model 1 | Model 2 | |
|---|---|---|
| \(acc\) | 0.74 | 0.74 |
| \(Pr_{bot}\) | 0.95 | 0.73 |
| \(Pr_{real}\) | 0.64 | 0.74 |
| \(R_{bot}\) | 0.50 | 0.75 |
| \(R_{real}\) | 0.98 | 0.73 |
The F1 measure.
Note: we combine Pr and R for each class, so we get two F1 measures.
Formally:
## prediction
## reality Bot Real Sum
## Bot 100 100 200
## Real 5 195 200
## Sum 105 295 400
| Model 1 | Model 2 | |
|---|---|---|
| \(acc\) | 0.74 | 0.74 |
| \(Pr_{bot}\) | 0.95 | 0.73 |
| \(Pr_{real}\) | 0.64 | 0.74 |
| \(R_{bot}\) | 0.50 | 0.75 |
| \(R_{real}\) | 0.98 | 0.73 |
| \(F1_{bot}\) | 0.66 | … |
| \(F1_{real}\) | 0.77 | … |
confusionMatrix(nb_pred, as.factor(test_data$account))## Confusion Matrix and Statistics
##
## Reference
## Prediction bot real
## bot 124 140
## real 76 60
##
## Accuracy : 0.46
## 95% CI : (0.4104, 0.5102)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.9506
##
## Kappa : -0.08
## Mcnemar's Test P-Value : 1.814e-05
##
## Sensitivity : 0.6200
## Specificity : 0.3000
## Pos Pred Value : 0.4697
## Neg Pred Value : 0.4412
## Prevalence : 0.5000
## Detection Rate : 0.3100
## Detection Prevalence : 0.6600
## Balanced Accuracy : 0.4600
##
## 'Positive' Class : bot
##
What’s behind the model’s predictions?
Needed: a representation across all possible values
#for each class probability:
threshold_1 = probs[1]
threshold_1## [1] 0.4822156
pred_threshold_1 = ifelse(probs >= threshold_1, 'bot', 'real')| bot | real | |
|---|---|---|
| bot | 183 | 17 |
| real | 188 | 12 |
| bot | real | |
|---|---|---|
| bot | 183 | 17 |
| real | 188 | 12 |
\(Sens. = 183/200 = 0.92\)
\(Spec. = 12/200 = 0.06\)
\(Sens. = 183/200 = 0.92\)
\(Spec. = 12/200 = 0.06\)
| Threshold | Sens. | 1-Spec |
|---|---|---|
| 0.48 | 0.92 | 0.94 |
Do this for every threshold observed.
auc1 = roc(response = test_data$account
, predictor = probs
, ci=T)
auc1##
## Call:
## roc.default(response = test_data$account, predictor = probs, ci = T)
##
## Data: probs in 200 controls (test_data$account bot) < 200 cases (test_data$account real).
## Area under the curve: 0.5534
## 95% CI: 0.4971-0.6098 (DeLong)
Next week: Machine Learning 2